2024-04-23
OK, you ran a regression/fit a linear model and some of your variables are log-transformed.
Source is here for further reference (also see Class 25 README.)
Exponentiate the coefficient. This gives the multiplicative factor for every one-unit increase in the independent variable.
from this source
Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units.
Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002.
For x percent increase, multiply the coefficient by log(1.x).
Example: For every 10% increase in the independent variable, our dependent variable increases by about 0.198 * log(1.10) = 0.02
from this source
Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable.
from this source
A brief introduction to …
Examples:
BigCities <- world_cities |>
arrange(desc(population)) |>
head(4000) |>
select(longitude, latitude)
glimpse(BigCities)Rows: 4,000
Columns: 2
$ longitude <dbl> 121.45806, 28.94966, -58.37723, 72.88261, -99.12766, 116.397…
$ latitude <dbl> 31.22222, 41.01384, -34.61315, 19.07283, 19.42847, 39.90750,…
You can visit https://www.tidymodels.org/learn/statistics/k-means/ for a brief explanation of the clustering process using an animation from Allison Horst. Here, I’ll look at the pieces one slide at a time.
Allison’s materials are available at https://github.com/allisonhorst/stats-illustrations/tree/master/other-stats-artwork
pen <- penguins |> ## from palmerpenguins package
select(bill_length_mm, bill_depth_mm, species) |>
drop_na()
glimpse(pen)Rows: 342
Columns: 3
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 3…
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 1…
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie,…
We know there are three types of penguins included, so let’s try \(k\) = 3.
K-means clustering with 3 clusters of sizes 153, 64, 125
Cluster means:
bill_length_mm bill_depth_mm
1 -0.9431819 0.5595723
2 1.1018368 0.7985421
3 0.5903143 -1.0937700
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2
[186] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3
[223] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 2 3 2 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3
[260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 1 2
[297] 3 2 2 2 2 2 2 2 1 2 1 2 2 3 2 2 3 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 1 2 3 2 2
[334] 2 2 3 3 2 1 2 2 2
Within cluster sum of squares by cluster:
[1] 87.98608 39.03381 59.35481
(between_SS / total_SS = 72.7 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
tidy() and glance() do?# A tibble: 3 × 5
bill_length_mm bill_depth_mm size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 -0.943 0.560 153 88.0 1
2 1.10 0.799 64 39.0 2
3 0.590 -1.09 125 59.4 3
augment() do?Rows: 342
Columns: 4
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 3…
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 1…
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie,…
$ .cluster <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
speciesp_clusts <-
tibble(k = 1:6) |>
mutate(
pclust = purrr::map(k, ~ kmeans(pen1, .x)),
ptidy = purrr::map(pclust, tidy),
pglance = purrr::map(pclust, glance),
paug = purrr::map(pclust, augment, pen))
p_clusters <- p_clusts |> unnest(cols = c(ptidy))
p_assigns <- p_clusts |> unnest(cols = c(paug))
p_clusterings <- p_clusts |> unnest(cols = c(pglance))The goal here is to summarize the information contained in a large set of variables by means of a smaller set of “summary index” values that can be more easily visualized and analyzed.
Statistically, PCA finds lines, planes and hyper-planes in K-dimensional space that approximate the data as well as possible, specifically by identifying components that maximize the variance of the projected data.
… in regression analysis, the larger the number of explanatory variables allowed, the greater is the chance of overfitting the model, producing conclusions that fail to generalise to other datasets. One approach, especially when there are strong correlations between different possible explanatory variables, is to reduce them to a few principal components and then run the regression against them, a method called principal component regression. (Wikipedia)
Sure. See https://allisonhorst.github.io/palmerpenguins/articles/pca.html which is the basis for my next few slides.
We’ll build this within the tidymodels framework, and first use a few recipe steps to pre-process the data for PCA, specifically:
# A tibble: 16 × 4
terms value component id
<chr> <dbl> <chr> <chr>
1 bill_length_mm 0.455 PC1 pca
2 bill_depth_mm -0.400 PC1 pca
3 flipper_length_mm 0.576 PC1 pca
4 body_mass_g 0.548 PC1 pca
5 bill_length_mm -0.597 PC2 pca
6 bill_depth_mm -0.798 PC2 pca
7 flipper_length_mm -0.00228 PC2 pca
8 body_mass_g -0.0844 PC2 pca
9 bill_length_mm -0.644 PC3 pca
10 bill_depth_mm 0.418 PC3 pca
11 flipper_length_mm 0.232 PC3 pca
12 body_mass_g 0.597 PC3 pca
13 bill_length_mm 0.146 PC4 pca
14 bill_depth_mm -0.168 PC4 pca
15 flipper_length_mm -0.784 PC4 pca
16 body_mass_g 0.580 PC4 pca
Standard deviations (1, .., p=4):
[1] 1.6594442 0.8789293 0.6043475 0.3293816
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
body_mass_g 0.5483502 0.084362920 0.5966001 0.5798821
bill_length_mm 0.4552503 0.597031143 -0.6443012 0.1455231
bill_depth_mm -0.4003347 0.797766572 0.4184272 -0.1679860
flipper_length_mm 0.5760133 0.002282201 0.2320840 -0.7837987
penguin_pca |>
mutate(terms = tidytext::reorder_within(terms,
abs(value),
component)) |>
ggplot(aes(abs(value), terms, fill = value > 0)) +
geom_col() +
facet_wrap(~component, scales = "free_y") +
tidytext::scale_y_reordered() +
scale_fill_manual(values = c("#b6dfe2", "#0A537D")) +
labs(
x = "Absolute value of contribution",
y = NULL, fill = "Positive?"
)USArrests data, from base RThe USArrests data set from base R is a table containing the number of arrests per 100,000 residents in each US state in 1973 for Murder, Assault and Rape, along with the percentage of the population in each state that lives in urban areas, called UrbanPop.
USArr73 tibble# A tibble: 50 × 5
state Murder Assault UrbanPop Rape
<chr> <dbl> <int> <int> <dbl>
1 Alabama 13.2 236 58 21.2
2 Alaska 10 263 48 44.5
3 Arizona 8.1 294 80 31
4 Arkansas 8.8 190 50 19.5
5 California 9 276 91 40.6
6 Colorado 7.9 204 78 38.7
7 Connecticut 3.3 110 77 11.1
8 Delaware 5.9 238 72 15.8
9 Florida 15.4 335 80 31.9
10 Georgia 17.4 211 60 25.8
# ℹ 40 more rows
USArr73_s <- USArr73 |> select(-state) |> scale()
row.names(USArr73_s) <- row.names(USArrests)
head(USArr73_s) Murder Assault UrbanPop Rape
Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
Arizona 0.07163341 1.4788032 0.9989801 1.042878388
Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144 1.7589234 2.067820292
Colorado 0.02571456 0.3988593 0.8608085 1.864967207
K-means clustering with 3 clusters of sizes 13, 29, 8
Cluster means:
Murder Assault UrbanPop Rape
1 0.6950701 1.0394414 0.72263703 1.27693964
2 -0.7010700 -0.7071522 -0.09924526 -0.57773737
3 1.4118898 0.8743346 -0.81452109 0.01927104
Clustering vector:
Alabama Alaska Arizona Arkansas California
3 1 1 3 1
Colorado Connecticut Delaware Florida Georgia
1 2 2 1 3
Hawaii Idaho Illinois Indiana Iowa
2 2 1 2 2
Kansas Kentucky Louisiana Maine Maryland
2 2 3 2 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 2 3 1
Montana Nebraska Nevada New Hampshire New Jersey
2 2 1 2 2
New Mexico New York North Carolina North Dakota Ohio
1 1 3 2 2
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 2 2 3
South Dakota Tennessee Texas Utah Vermont
2 3 1 2 2
Virginia Washington West Virginia Wisconsin Wyoming
2 2 2 2 2
Within cluster sum of squares by cluster:
[1] 19.922437 53.354791 8.316061
(between_SS / total_SS = 58.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
augment() from broom# A tibble: 50 × 6
.rownames Murder Assault UrbanPop Rape .cluster
<chr> <dbl> <dbl> <dbl> <dbl> <fct>
1 Alabama 1.24 0.783 -0.521 -0.00342 3
2 Alaska 0.508 1.11 -1.21 2.48 1
3 Arizona 0.0716 1.48 0.999 1.04 1
4 Arkansas 0.232 0.231 -1.07 -0.185 3
5 California 0.278 1.26 1.76 2.07 1
6 Colorado 0.0257 0.399 0.861 1.86 1
7 Connecticut -1.03 -0.729 0.792 -1.08 2
8 Delaware -0.433 0.807 0.446 -0.580 2
9 Florida 1.75 1.97 0.999 1.14 1
10 Georgia 2.21 0.483 -0.383 0.488 3
# ℹ 40 more rows
tidy() and glance() from broom# A tibble: 3 × 7
Murder Assault UrbanPop Rape size withinss cluster
<dbl> <dbl> <dbl> <dbl> <int> <dbl> <fct>
1 0.695 1.04 0.723 1.28 13 19.9 1
2 -0.701 -0.707 -0.0992 -0.578 29 53.4 2
3 1.41 0.874 -0.815 0.0193 8 8.32 3
432 Class 25 | 2024-04-23 | https://thomaselove.github.io/432-2024/